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Abstract 

Background: Motifs are patterns found in biological sequences that are vital for understanding gene function, 
human disease, drug design, etc. They are helpful in finding transcriptional regulatory elements, transcription factor 
binding sites, and so on. As a result, the problem of identifying motifs is very crucial in biology. 

Results: Many facets of the motif search problem have been identified in the literature. One of them is (£, d)-motif 
search (or Planted Motif Search (PM5)). The PMS problem has been well investigated and shown to be NP-hard. Any 
algorithm for PMS that always finds all the (£, c/)-motifs on a given input set is called an exact algorithm. In this 
paper we focus on exact algorithms only. All the known exact algorithms for PMS take exponential time in some 
of the underlying parameters in the worst case scenario. But it does not mean that we cannot design exact 
algorithms for solving practical instances within a reasonable amount of time. In this paper, we propose a fast 
algorithm that can solve the well-known challenging instances of PMS: (21, 8) and (23, 9). No prior exact algorithm 
could solve these instances. In particular, our proposed algorithm takes about 10 hours on the challenging instance 
(21, 8) and about 54 hours on the challenging instance (23, 9). The algorithm has been run on a single 2.4GHz PC 
with 3GB RAM. The implementation of PMSS is freely available on the web at http://www.pms.engr.uconn.edu/ 
downloads/PMSS.zip. 

Conclusions: We present an efficient algorithm PMSS that uses some novel ideas and combines them with well- 
known algorithm PMSl and PMSPrune. PMSS can tackle the large challenging instances (21, 8) and (23, 9). 
Therefore, we hope that PMSS will help biologists discover longer motifs in the futures. 



1 Background 

The discovery of patterns in DNA, RNA, and protein 
sequences has led to the solution of many vital biological 
problems. For instance, the identification of patterns in 
nucleic acid sequences has resulted in the determination 
of open reading frames, identification of promoter ele- 
ments of genes, identification of intron/exon splicing 
sites, identification of SH RNAs, location of RNA degra- 
dation signals, identification of alternative splicing sites, 
etc. In protein sequences, patterns have proven to be 
extremely helpful in domain identification, location of 
protease cleavage sites, identification of signal peptides, 
protein interactions, determination of protein degrada- 
tion elements, identification of protein trafficking ele- 
ments, discovery of short functional motifs, etc. Motifs 
are patterns found in biological sequences that are vital 
for understanding many biological subjects like gene 
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function, human disease, drug design etc. As a result, the 
identification of motifs plays an important role in biologi- 
cal studies. The motif search problem has been attracting 
many researchers. In the literature, many versions of the 
motif search problem have been enumerated. Examples 
include Simple Motif Search (SMS), Planted Motif Search 
(PMS) - also known as (£, d)-motif search, and Edit-dis- 
tance-based Motif Search (EMS) (see e.g., [1]). In this 
paper, we will focus on the PMS problem (or PMS for 
short). 

The definition of Planted Motif Search (PMS) 

PMS is stated as follows. It takes as input n sequences, two 
integers i and d. For simplicity, we assume that the length 
of each sequence is m. The problem is to identify all 
strings M of length £ such that M occurs in each of the n 
sequences with at most d mismatches. Formally, string M 
has to satisfy the following constraint: there exists a string 
Mi of length / in sequence i, for every i (1 < / < «), such 
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Table 1 A comparison between Af{£, d) and £[Bd(x, y, z)] 
for different values of £ and d 
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3.12 X 10 
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1.99 X 10^° 
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that the number of mismatches between M and M, is less 
than or equal to d. The number of mismatches between 
two strings of equal length is known as the Hamming dis- 
tance between them. String M is called a motif. For exam- 
ple, if the input sequences are GCGCGAT, CAGGTGA, 
and CGATGCC; £ = 3; and d = 1, then GAT and GTG are 
some of the {I, d)-motiis. PMS is a well-studied problem 
and it has been shown to be NP-hard. As a result, all 
known exact algorithms for PMS take exponential time in 
some of the underlying parameters in the worst case. Two 
kinds of algorithms have been proposed in the literature 
for PMS: exact and approximate. While an exact algorithm 
always finds all the motifs, an approximate algorithm may 
not always find all the motifs. Typically, approximate algo- 
rithms tend to be faster than exact algorithms. Some exam- 
ple approximate algorithms are due to Bailey and Elkan [2], 
Buhler and Tompa [3], Lawrence et al. [4], Pevzner and Sze 
[5], and Rocke and Tompa [6]. These algorithms are based 
on local search techniques such as Gibbs sampling, expec- 
tation optimization, etc. The WINNOWER algorithm of 
[5] is based on finding cliques in a graph. Some other 
approximate algorithms are: PROJECTION [3], MULTI- 
PROFILER [7], PatternBranching [8], CONSENSUS [9], 
GibbsDNA [4], MEME [2], and ProfUeBranching [8]. 

Although approximate algorithms are acceptable in 
some cases in practice, exact algorithms are preferable 
since they are guaranteed to report all the {/, (i) -motifs. 
For biologists, the motifs found by an algorithm could be 
much more important than its run time. As a result, we 
focus in this paper on efficient exact algorithms. Some 
exact algorithms known for PMS are: [10-18], and [19]. 

Buhler and Tompa [3] have employed PMS algorithms 
to find known transcriptional regulatory elements 
upstream of several eukaryotic genes. In particular, they 
have used orthologous sequences from different organisms 
upstream of four different genes: preproinsulin, dihydrofo- 
late reductase (DHFR), metallothioneins, and c-fos. These 
sequences are known to contain binding sites for specific 
transcription factors. The authors point out the differences 
between experimental data and synthetic data that PMS 



algorithms are typically tested with. For example, the back- 
ground DNA in experimental data is not random. Their 
algorithm successfully identified the experimentally deter- 
mined transcription factor binding sites. They have used 
the values of 1 = 20 and d = 2. The same sites have also 
been found using our PMS2 algorithm [11]. The algorithm 
of [3] is an approximation algorithm whereas PMS2 is an 
exact algorithm. Buhler and Tompa have also employed 
their algorithm to solve the ribosome binding site problem 
for various prokaryotes [3]. This problem is even more 
challenging since here the number of input sequences 
could be in the thousands. 

Eskin and Pevzner [13] used PMS algorithms to find 
composite regulatory patterns. They point out that tradi- 
tional pattern finding techniques (on unaligned DNA 
sequences) concentrate on identifying high-scoring mon- 
ads. A regulatory pattern could indeed be a combination 
of multiple and possibly weak monads. They employ 
MITRA (a PMS algorithm) to locate regulatory patterns 
of this kind. The algorithm is demonstrated to perform 
well on both synthetic and experimental data sets. For 
example, they have employed the upstream regions 
involved in purine metabolism from three Pyrococcus 
genomes. They have also tested their algorithm on four 
sets of S.cerevisiae genes which are regulated by two 
transcription factors such that the transcription factor 
binding sites occur near each other. Price and Pevzner 
[8] have employed their PatternBranching PMS techni- 
que on a sample containing CRP binding sites in E.coli, 
upstream regions of many organisms of the eukaryotic 
genes: preproinsulin, DHFR, metallothionein, & c-fos, 
and a sample of promoter regions from yeast. They 
report finding numerous motifs in these sequences. 

The performance of an exact algorithm is typically eval- 
uated on random benchmark data generated as follows. 
Twenty input DNA sequences, each of length 600, are 
generated randomly from the alphabet 2 = {A, C, G, T}. A 
motif M of length £ is also generated randomly and 
planted in each of the input sequences within a Hamming 
distance of d to make sure that there always exists a motif 
in the input. Based on the values of £ and d, certain 
instances of PMS have been identified to be challenging. 
An instance is challenging if the expected number of the 
motifs that occur by random chance (in addition to the 
planted one) is one or more. For example, the following 
instances are challenging: (9, 2), (11, 3), (13, 4), (15, 5), (17, 
6), (19, 7), (21,8), (23, 9), etc. 

To compare the performance of exact algorithms, the 
challenging instances are commonly used. For example, 
the exact algorithm MITRA of [8] can solve the challen- 
ging instances (9, 2), (11, 3), and (13, 4). It takes either 
too much time or too much memory on the challenging 
instance (15, 5) or any larger instances. Both the exact 
algorithm Voting in [20] and the exact algorithm 
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RISOTTO in [21] can solve the challenging instances up 
to (15, 5). In most of the cases, Voting runs faster than 
RISOTTO. The best up-to-date exact algorithm is Pampa 
given in [10]. Pampa can solve the challenging instance 
(19, 7) within about 4.8 hours. The second best exact 
algorithm is PMSPrune [22] that can solve the challen- 
ging instance (19, 7) within about 10 hours. 

In this paper we present an exact algorithm (named 
PMS5) that can solve the challenging instances (21, 8) and 
(23, 9). PMS5 takes about 10 hours on the challenging 
instance (21, 8) and about 54 hours on the challenging 
instance (23, 9). These run times are on a single 2.4GHz 
PC with 3GB of RAM. To the best of our knowledge, no 
other exact algorithm can solve these instances. 

2 Methods 

2.1 Notations and Definitions 

In this section we introduce some notations and defini- 
tions that will help us describe our algorithm clearly. 

Definition 2.1. A string x = x[l] ... x[l] of length £ is 
called an l-mer. 

Definition 2.2. Given two strings x and y of equal 
length, we say the Hamming distance between x and y, 
denoted by dnix, y). is the number of mismatches 
between them, 

Definition 2.3. Given a string x = x[\\ ... x[V\, we define 
the d-neighborhood ofx, B^x), to be {y \ dnix, y) < d}. 

Note that \Bdix)\ = - 1)'' where S is the 

alphabet of interest. Notice that B^iix) depends only on 
£, d and For this reason, we define 

m,d) = j2lo (0(1^1 -I)'- 

Definition 2.4. Given two strings x = x{\\ ... x{i] and s 
= s [1] ... s[m\ with £ <m: 

1. We use the notation x & i s if x is a substring of s 
(equivalently, s = ax^ for some strings a and P). We 
also say that x is an t-mer of s. 

2. We define ~dH[x,s) = min,6^sdH(x r). 

Definition 2.5. Given a string x = x[l] ... x\V^ and a 
set of strings S = {si, ...,Sn}with |s, | = m for i = 1, n 
and £ <m, we define dH[x,S) = maxi<i<„dH(^, ^O- 

It is easy to see that x is an (£, (i) -motif of S if and 
only if dH{x,S) < d. 

Definition 2.6. Given a set of strings S = {si, ...,Sn], we 
define Mi4{S)to be the set of {I, d) motifs of S- 

The goal of PMS is to compute M{,d(5), given £, d 
and S. 

2.2 PMS5 - A fast algorithm 

The idea of our algorithm is based on the following 
observations about Mi4{S). 



Observation 2.1. Let S, S'and S"be three sets of 
strings such that S = S' S"and S' n S" = 0- ^t is easy to 
observe that Mi,d{S) = Mi,d{S') n Mi,d{S"). 

Observation 2.2. For any string s, 

= M Ba[x). 

From Observation 2.1 and Observation 2.2, we can 
obtain the following observation. 

Observation 2.3. Let S* = S\{si} = {si, ...,s„}. We have 
Mi4{S) = \\ [BA[x)r^Ml4{S*)\ 

Observation 2.3 tells us that Mi4{S) can be computed 
imm Bd[x) r^ Mi4[S*). 

Without loss of generality, we can assume that the 
size of S* is even, i.e., |iS*| = n — \ = 2p, for some inte- 
ger p. Otherwise we can add a copy of s„ into S*, in 
which case Mc4[S*) will remain the same. Next, we par- 
tition S* into pairs of strings Si,...,Sp, where 
Sk = {S2k,S2ii+i] for ^ = 1 ... p. From Observations 2.1 
and 2.2, we can make the following observations. 

Observation 2.4. 

Bd{x) n M,4{S*) = ni<fe<p [Bdix] n Mi4[Sk)]. 
Observation 2.5. 



\\ [Bj{x)nB4Y)nBi{z)]. 



Based on the above observations, we note that the pro- 
cess of computing Mi^^S) can be reduced to computing 
Bdix, y, z) = Bd{x) n BJj) n Ba{z) repeatedly. We will dis- 
cuss how to compute BJ^x, y, z) efficiently in Section 2.2.2. 
The pseudocode of our algorithm PMS5 is given below. 

Algorithm 

PMS5 

for each ;>c e ^ 5i do 

n — 1 



for /c = 1 to p ■ 



do 



5 
6 

7 

8 

9 

10 

11 

12 

13 

14: 

15 

16 

17 

18 

19 



Q ^ 0. 

for each y & t S2k and z e £ S2k+i do 

Compute Ba{x, y, z) = B^ix) n B^iy) n B^iz). 
Q ^ Q U Baix, y, z). 

end for 

if A: = 1 then 

Q'^ Q. 
else 

Q' ^ Q' n Q. 
end if 

if IQ'I is small enough then 

break the for loop, 
end if 
end for 

for each re Q' do 
ii~dH{r,S) < dthen 
Output r as an (£, d) motif. 
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20: end if 
21: end for 
22: end for 

In the pseudo code, the process of computing 
Bdix) n d(iSfe) for each k is from Une 3 to Une 7. After 
line 7, Q is actually Bd{x) n Mi^diSk)- Within the loop at 
line 2, Q' is Bd{x) n Mt4{Si) n • • • n Mi4{Sk) for each k 
after line 12. At line 13, if \Q'\ is less than a certain thresh- 
old, the algorithm simply exits the loop and will not try 
other values of k. In practice, we set the threshold to be 
between 5000 and 10000. From line 17 to line 21, the algo- 
rithm checks if each string r L Q' is actually an (£, li) -motif 
or not. To check if duir, S) < d for any r, we only have to 
use the remaining sequences {s2k+2> S2k+3> ■■■> Sn)- 

2.2.1 Analysis 

PMS5 is correct From the observations, it is straightfor- 
ward to see that PMS5 outputs Mt^d[S). Therefore, 
PMS5 is correct. 

The worst-case run time of PMS5 Theorem 2.1. The 

worst-case run time of PMS5 is 0{nm^dJ\f{£, d))- Recall 

thatM{l,d) = \Bdix)\ = Eto (0(1^1 - !)'• 

Proof. It is easy to see that the run time of PMS5 is 
dominated by the computation time of B^{x, y, z) in line 
5. In Section 2.2.2, we will show that BJ^x, y, z) can be 
computed in 0(£ + d\Bii{x, y, z)\) time. In the extreme 
case in which x = y = z, \Bd{x,y,z))\ = |Bd(x)| =J\f{£,d). 
Since J\f{£, d) is much larger than £, the computation 
time of B^{x, y, z) is 0{dN'{£,d)). Also, it is straightfor- 
ward to see that the number of times BJ^x, y, z) is com- 
n o 

puted is at most —[m — £ + 1) . Hence, the run time of 
PMS5 is 0{nm^dAf{£,d)). 

The expected run time of PMS5 We can compute the 
expected run time of of PMS5 by computing the 
expected value of B^ix, y, z). Let x, y, and z be three 
random £-mers. How many £-mers are there that are at 
a distance of < d from each of x, y, and z? Let u he a 
random £-mer. 

Prob.[d„(x,«) < d] = = (0(3/4)' (1/4)'-'- This 

means that 
Prob.[dH(x, m) < d&dniy, u) < d&dniz, u) < d] = p^^. 
Therefore, the expected number of u's such that u is at 
a distance of < d from each of x, y, and z, E[B^(x, y, z)], 

As a result, the expected run time of PMS5 is 

O (nm'dAy,,). where p,., = (f) (3/4)'(l/4)^-'. 

Table 1 gives a comparison between J\f[£, d) and £[5^ 
(x, y, z)] for different values of £ and d. 

2.2.2 Computing ttie intersection of the d-neighborhoods 

In this section, we consider the problem of computing 
the intersection of the (^-neighborhoods Bd(x, y, z). 



Given three £-mers x, y, z and integer number d, we 
would like to list all of the £-mers in BJ^x, y, z). In this 
section we offer an algorithm FULLPRUNE for this task 
that runs in 0(£ + d\Bii{x, y, z)\) time. 

FULLPRUNE is the heart of algorithm PMS5. The 
idea of FULLPRUNE is as follows. We first represent B^ 
{x) as a tree 7d(x) in which each node is an £-mer in B^ 
{x) and its root is the £-mer x. The depth of fd{x) is d. 
We will describe 7d(x) in detail later. We traverse 7d(x) 
in a depth-first manner. At each node t during the tra- 
versal, we output t if t is in BJj) n BJ^z). We also check 
if there is a descendent t' of t such that t' is in Bj(y) n 
Bdiz). If there is no such descendent, we prune the sub- 
tree rooted at node t. We will show that checking the 
existence of such a descendent can be done quickly in 
0(1) time, later. Formally, Td{x) is constructed from the 
following rules. 

Rules to construct Td{x). 

1. Each node in Td{x) is a pair {t, p) where t = t[l] ... 
t[l] is an £-mer and p is an integer between 0 and £ 
such that t[p + I] ... t[Z] = x\p + I] ... x[t]. We refer 
to a node {t, p) as £-mer t if p is clear. 

2. Let t = t{l\ ... t[l] and t' = t'[l\ ... t'[l]. A node {t, 
p) is the parent of a node {t', p) if and only if 

(a) p >p. 

(b) t'[p'] ^ t\p'] (From Rule 1, t\p'] = x[p']). 

(c) t'[i] = t[i] for any i ^ p' 

3. The root of 7rf(x) is [x, 0). 

4. The depth of Td{x) is d. 

Clearly, each £-mer in B^{x) is uniquely associated 
with a node in fd(x) and vice versa. Figure 1 illustrates 
the tree 7^(1010) with alphabet S = {0, 1}. 

It is not hard to see that Td{x) has the following 
properties. 

Properties of Td{x). 

1. If a node [f, p) is a child of a node {t, p), then dn 
{x, t') - dnix, t) = dnit, t') = 1. As a result, if a node 
{t, p) at level h, then dnix, t) = h. 




Figure 1 7^(1010) with alphabet S = {0,1} The value of p at 
each node is the location of its shaded letter. For example, p = 2 at 
node 111 0, p = 3 at node 0000. 
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2. Consider two nodes {t, p) and {t', p) with t = t[l\ 
... t[l] and f = t'[l] ... t'[l]. Then {t', p) is a descen- 
dent of {t, p) if and only if: 

(a) p >p. 

(b) m ... t'\p\ = t[l] ... t\p]. 

(c) dnix, t') < d. 

Now we consider the subproblem of checking 
whether there is a descendent {t' , p') of (t, p) such 
that t' is in B^iy) n B^(z). Solving the subproblem is 
very crucial for FULLPRUNE because it will help us 
know beforehand for sure which nodes should be 
explored. The second property above is important to 
solve the subproblem. Let t = t[l] ... t[l], x = x[l] ... x 
[i],y = y[l] ... y[i] and z = z[l] ... z[i]. Let = t[l] ... t 
[p] and t2 = t[p + 1] ... t[l]. We define Xi, Xi, yi, j2> 
and Z2, similarly. Notice that X2 = ^2- Because of the 
second property t' must have the form t' = tiW, where 
w is an (£ - /7)-mer. Therefore, there is a descendent 
{t', p') of {t, p) such that t' is in B^iiy) n if and 

only if there is an (£ - p)-mer: w satisfying the follow- 
ing constraints: 

1. dnix, t') = dnixi, ti) + dH(x2, w) < d. 

2. d„{y, t') = dniji, k) + dniyi, w) < d. 

3. dniz, t') = dnizi, h) + i^z/fe. w) < d. 

We will show that the constraints can be expressed by 
an integer linear program of ten variables. Each location 
i in X2, j2 and Z2 is one of five types. 

♦ Type 1 (or Type aaa): X2li] = y2[i] = Z2[i]. 

♦ Type 2 (or Type aab): X2[i] = y2[i] * Z2U]. 

♦ Type 3 (or Type aba): X2li] = Z2[i] * y2V\- 

♦ Type 4 (or Type baa): X2[i] * yili] = Z2[i]. 

♦ Type 5 (or Type abc): X2[i] ^ J'2[i]. X2U] * Z2[i], J2 

[/] 7^ Z2[l\. 

Let «i (resp. «2, K3, «3, «4, and n^) be the number of 
locations of Type 1 (resp. Type 2, Type 3, Type 4, and 
Type 5). Given X2, J2 and Z2, nj is determined for 7 = 1 
... 5. Notice that «i + - + «5 = £ - p. 

Consider any (£ - p)-m.er: w = w[l] ... w[t - p\. We 
define the following variables. 

♦ Let N\^a be the number of locations / of Type 1 
such that w[/] = X'^i\. We should have N^^ < n^. 

♦ Let N2,a (resp. N2,b) be the number of locations i 
of Type 2 such that w[i\ = X2li] (resp. w[i] = Z2li]). 
We should have N2,a + N2,b ^ «2- 

♦ Let N-i^a (resp. N^,b) be the number of locations i 
of Type 3 such that w[i] = X2[i\ (resp. w[i] = j'2['])- 
We should have N^^a + ^z,b ^ «3- 



♦ Let A/4_a (resp. A^,^) be the number of locations / 
of Type 4 such that w[i] = y2li] (resp. w[i] = X2[i])- 
We should have A/4_a + N4_b ^ «4- 

♦ Let a (resp. A^s.^,, N^,c) be the number of loca- 
tions i of Type 5 such that w[i] = X2[i] (resp. w[i] = 
y2[i], w[i] = Z2U]). We should have N^^ + N^^b + A^s.c 

Now it is straightforward to calculate dH{x2, w) 
through the variables. The number of mismatches 
between X2 and w caused by the locations of Type 1 
(resp. Type 2, Type 3, Type 4, and Type 5) is «i - A/^i,^, 
(resp. «2 - N2,a, «3 - A/'s.a, «4 - A^4,6, and «5 - A/'s,^). 
Therefore, dH{x2, w) = (ki - Ni^a) + («2 - ^2,0) + («3 - 
N^.a) + («4 - N4_b) + («5 - A^5,a)- Similarly, d„{y2, w) = 

(«1 - + («2 - N2,a) + («3 - A/^3,fe) + («4 - + (Wg 

- Ns^b), and dH(z2, w) = («i - N^J + («2 - N2,b) + («3 - 
N3J + («4 - A?4,fl) + («5 - So the following integer 

linear program (ILP) expresses the constraints. 
Integer Linear Program (ILP). 

1- («1 - NiJ + («2 - N2J + («3 - + (K4 - N^^b) 

+ («5 - N5J < d - dilxx, ii). 

2. («i - A^i,«) + (n2 - A^2,fl) + («3 - A^3,&) + («4 - A/4,fl) 
+ («5 - Af5,fe) < - duiyx, ii). 

3. («1 - iVl,«) + («2 - A^2,fe) + («3 - A^3,«) + («4 - A4,a) 

+ («5 - A^5,c) s - d„{zx, h). 

4. A^i, « < «i. 

5. A/2, a+ N2,bi «2- 

6. A/3, a+ N3^b^ «3- 

7. A?4, a + A/4, < «4. 

8. A/g, a + A/g, + A/s,^ < Mg. 

9. All of the variables are non-negative integers. 

Clearly, there exists one or more w's satisfying the 
constraints if and only if the integer linear program has 
a solution. Notice that Wi + «2 + «3 + «4 + «5 = ^ - Z'- 
We can rewrite the first three constraints of the integer 
linear program as follows. 

L A/i, a+ N2,a+ A/3,a + N^.b + A/g,^ > I - p - d + dn 
(Xi, ti). 

2. Ni_ a + A/2, a + N3,b + N^,^ + N^.b > I - p - d + dH 
(yi, hi 

3. A/i, + A/2, 6 + A/3,a + A/4,a + Ns,c >t - p - d + dn 
(zi, h). 

Because the integer linear program has only ten vari- 
ables, checking whether it has a solution can be done in 
0(1) time. Notice that the integer linear program only 
depends on eight parameters «i, ... «g, d - dnixi, h), d - 
dniyit and d - dj-[{zi, ti). The first five parameters 
are in the range [0, £] and the other parameters are 
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in the range [0, ... d]. Therefore, we will store the results 
of all possible integer linear programs in a 8-dimen- 
sional table of size (£ + l)^{d + 1)^ to speedup the 
checking time for the integer linear programs during the 
traversal on the tree in FullPrune. Notice that we only 
need to compute the table once before FULLPRUNE is 
executed, and reuse it as many times as needed. The 
pseudocode of FULLPRUNE is given below. 
Algorithm FULLPRUNE 

1. Compute dnix, y) and dnix, z). 

2. Compute «i, n2, n^, «4 and for each p = 0... (£ - 1). 

3. Traverse the tree Td[x) in a depth-first manner. At 
each node {t, p), do the following steps. 

(a) Incrementally compute dnix, t), dniy, t), and 
dniz, t) from its parent. 

(b) Incrementally compute dnixi, ti), dniji, fi), 
and dnizi, h) from its parent. (Notice that ti = t 
[1] ... t[p], xi = x[l] ... x[p], ji = y[l] ... y[p] and 
Xl = z[l] ... z[p]). 

(c) If dnix, t) < d, dffiy, t) < d and (i//{z, t) < d, 
then output t. 

(d) Solve the integer linear program (ILP) with 
parameters «i, «2i «3. n^, ns, i - p - d + dnixi, 
ti), t - p - d + dniyi, ti), and ^ - p - d + dnizi, 

h). 

(e) If di-[{x, t) > d and/or the ILP does not have a 
solution, then prune the subtree rooted at node 
{t, p). Otherwise, explore its children. 

Theorem 2.2. Given three l-mers x, y and z, FULL- 
PRUNEcow/?Mfes BJ^x, y, z) in 0{£ + d\Bj{x, y, z)\) time. 

Proof. From the discussion above, FULLPRUNE out- 
puts all of the £-mers in B^{x, y, z). Now let us analyze 
its run time. In the pseudocode of FullPrune, step 1 and 
step 2 take 0(£) time. We will show that step 3 takes at 
most 0{d\B^{x, y, z)\) time, that will complete our 
proof. Since in 7d(x) a node and its parent differ at 
exactly one location, step 3a and step 3b take at most O 
(1) time. It is easy to see that the other steps inside step 
3 (from step 3c to step 3e) also take 0(1) time. There- 
fore, FULLPRUNE spends at most 0(1) time at each 
node it visits. As a result, the run time of step 3 is pro- 
portional to the number of the visited nodes. We will 
argue that the number of visited nodes is no more than 
d\Bci{x, y, z)\. Consider the tree 7" consisting of all the 
nodes visited by FullPrune. Obviously, 7d(^) contains 7~- 
Because of the property of the integer linear program, 
every leaf in T is an element in B^iix, y, z). Therefore, 
the number of leaves in 7" is at most B^ix, y, z). On the 
other hand, in any tree the number of nodes is no more 
than its depth times the number of its leaves. Since 
Td{x) contains f, the depth of T' is less than or equal to 
the depth of 7d[x), which is equal to d. Hence, the 



number of nodes in T, which is equal to the number of 
nodes visited by FullPrune, is at most d\Bj(x, y, z)\. 

We conclude this section with a remark that our algo- 
rithm FULLPRUNE can be generalized as follows. Right 
now we use the computation of the common d-neigh- 
borhood of three £-mers as the basic step. This can be 
generalized so that the basic step is that of computing 
the common (^-neighborhood of k £-mers (for any value 
of k < n). 

2.3 Extended PMS5 for Solving the q-PMS Problem 

In this section, we consider a generalized version of the 
PMS problem called the ^-PMS Problem (see e.g., [22]). 
In the ^-PMS problem, we relax the constraints on the 
motifs. An £-mer is a motif if there are at least q 
sequences 5; in S such that dnix, Si) < d. Apparently, the 
^-PMS problem becomes the PMS problem if ^ = «. In 
practice, the q-PMS problem is a more realistic model 
of motifs since these motifs usually appear in some of 
the given sequences, instead of appearing in all of them. 

We can extend the algorithm PMS5 to solve the q- 
PMS problem by exploiting the heart of PMS5, i.e., the 
algorithm FULLPRUNE that computes B^{x,y, z). One 
simple and straightforward way to extend PMS5 for the 
^-PMS problem is as follows. We consider every tuple 
of sequences (s„ Sj, s^), 1 < i <j <k < n. For each tuple 
{Si, Sj, S)t), we compute B^(x, y, z) where x, y, and z are in 
Si, Sj and 5^, respectively. For each £-mer t in B^{x, y, z), 
we check whether there are at least q-?> sequences Sp in 
5\{s,, 5j, Sfe} such that dnit, Sp) < d.li t satisfies this con- 
straint, we output f as a motif. The psuedocode is pro- 
vided below. 

Extended Algorithm PMS5 for ^-PMS 

1: for each tuple of sequences (s;, Sj, s^i, where 1 < i 
<j <k < n do 

2: for each tuple (x, y, z) of £-mers where x & i Si,y 
e e Sj, and z e t Sf; do 

3: Compute Ba(x, y, z) using FULLPRUNE. 

4: for each t e BJ^x, y, z) do 

5: if there are at least q-?) sequences 

Sp e 5\{Si, Sj, Sfe} such that df^t, Sp) < d then 

6: output t. 

7: end if 

8: end for 

9: end for 

10: end for 

The two following theorems are immediate: 
Theorem 2.3. The worst run time of the above algo- 
rithm is O [n'^m^dN' {I, d)). 
Theorem 2.4. The expected run time of the above 

algorithm is O (^n'^m^d4^p^ where 
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2.4 Challenging Instances for q-PMS 

The challenging instances for ^-PMS have to be defined 
appropriately. For every value of £, we can define a cor- 
responding challenging instance with a relevant value 
for d. We define the challenging instance corresponding 
to a given value of £ to be the pair (£, d) if d is the 
smallest value for which the expected number of (£, d)- 
motifs that occur by random chance is at least 1. In fact 
the same definition is used for the PMS problem as 
well. However, the identification of such instances is 
slightly different. We could identify the challenging 
instances for ^-PMS as follows. Let Si, S2, —, S„ be the 
given input strings. Consider a random £-mer w. Let S 
be any one of the input strings and x be an £-mer in S. 

Probability that the Hamming distance between w and 
X is < d is P= J2io (f) Probability that the 

Hamming distance between w and x is >d is (1 - P). 
Probability that the Hamming distance between w and 
each £-mer of 5 is >d is Q' = (1 - p)^""+\ Here we 
assume that the £-mers of S are independent, which is 
clearly incorrect. A similar assumption has been made 
in prior analyses (see e.g., [3]) and in practice conclu- 
sions made using such analyses seem to hold nearly. 
Probability that 5 has at least one £-mer x such that the 
Hamming distance between w and ;cis<(iisQ=l- 
Q'. If the Hamming distance between w and x is < d, 
call x as an instance of w. 

Probability that w has at least one instance in at least 

q of the n input strings is R = ^f,^ (^) Q'(l - Q)""'. 
Therefore, the expected number of motifs that occur by 
random chance is 4^7?. Table 2 displays the expected 
number of random motifs corresponding to various 
values of € and d with n = 20, m = 600 and q = 10. 
Challenging instances are shown in bold face. 



Table 2 The expected number of random motifs for q- 
PMS corresponding to various values of £ and d with n = 
20, m = 600 and q = 10 



1 


d 


Expected Number of Random Motifs 


9 


2 


1.599 


9 


1 


0.159 


11 


2 


1.424 


11 


1 


8.643 X 10^^ 


13 


3 


22.090 


13 


2 


1.530 X 10"' 


15 


4 


154 


15 


3 


7150 X 10"^ 


17 


5 


640 


17 


4 


5.277 X 10"*^ 


19 6 1883 


19 


5 


8.504 X 10"*^ 



Challenging instances of q-PMS are shown in bold face. 



3 Results and Discussion 

3.1 Performance of PMSS on the challenging instances 

In this section, we show the performance of PMS5 on 
the challenging instances as described in Section 1. We 
also compare the performance of PMSS with that of 
other well-known exact algorithms such as Pampa [10], 
PMSPrune [22], Voting [20], and RISSOTO [21]. Algo- 
rithms for planted motif search are typically tested on 
random input datasets. Any such dataset will consist of 
20 random strings each of length 600 (« = 20, m = 600). 
A random motif of length £ is planted at a random posi- 
tion in each of the strings, mutating it in exactly d 
places. We test the algorithms for varying £ and d 
values. In particular, we have employed the following 
challenging instances: (13, 4), (15, 5), (17, 6), (19, 7), 
(21, 8), and (23, 9). 

To have a fair comparison, we have run all of the 
algorithms on the same machine. The configuration of 
the machine is Windows XP Operating System, Dual 
Core Pentium 2.4GHz CPU with 3GB RAM. PMSS is 
written in C language. Pampa, PMSPrune and RISSOTO 
were also written in C language. Only Voting was writ- 
ten in C-H-. All of the algorithms were compiled using 
Microsoft Visual C-1--1- 6.0 Compiler. 

Table 3 shows the performance of the algorithms on 
the challenging instances. In Table 3, the letter '-' means 
that the corresponding algorithm either uses too much 
memory or takes too long on the challenging instance. 
In other words, the algorithm cannot solve the challen- 
ging instance in the experimental settings. We see that 
PMSS outperforms the other algorithms on all of the 
challenging instances except on (13,4) and notably 
PMSS is the only algorithm that can solve the two chal- 
lenging instances (21, 8) and (23, 9). PMSS takes more 
time than Pampa, PMSPrune and Voting on (13,4) 
because it takes an additional amount of time to load 
the table that stores the results of the integer linear pro- 
grams. This process takes about SO seconds. On the lar- 
ger challenging instances, this amount of time is 
negligible. 

While comparing PMSS and PMSPrune, we notice an 
interesting fact that as the challenging instance increases 
in size, the ratio between their run times increase expo- 
nentially. In particular, the ratio is roughly 2,4, and 8 on 



Table 3 Time comparison on challenging instances 



Algorithm 


(13,4) 


(15,5) 


(17,6) 


(19,7) 


(21,8) 


(23,9) 


PMSS 


117s 


4.8 m 


21.7 m 


1.7h 


9.7h 


54h 


Pampa 


35s 


6 m 


40 m 


4.8h 






PMSPrune 


45s 


102 m 


78.7 m 


152h 






Voting 


104s 


21.6 m 










RISOTTO 


772s 


1064 m 
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the challenging instances (15,5), (17,6), and (19,7), 
respectively. This fact perhaps explains why PMS5 can 
solve the challenging instances (21, 8) and (23, 9) but 
PMSPrune cannot. If this observation is true in general, 
PMSPrune will probably take about 16 x 9.7 = 155.2 
hours on the instance (21, 8), and 32 x 54 = 1728 hours 
on the instance (23, 9). 

Notice that the run time of PMS5 does not include 
the time for building the ILP table. It takes 1.5 hours 
and 500MB to build and store the ILP table for £ = 23 
and d = 9. 

3.2 PMS5 on real data: predicting transcript factor- 
binding sites 

In this section, we discuss how to use algorithm PMS5 
to find transcript factor-binding sites in the real data 
provided in [23]. The real data is broadly used to test 
many existing algorithms [23], [11], [22], [3]. Each data- 
set in the real data is comprised of DNA sequences with 
verified transcript factor-binding sites. The datastes are 
from many species including mouse, human and yeast. 

We have used the algorithm PMS5 to find transcript 
factor-binding sites as follows. For any given dataset, we 
have run PMS5 with I = 21, d = 8, and obtained a set 
of motifs. Some of these motifs could be spurious hits. 
Hence, we need a scoring scheme to rank the motifs. 
We have used the simple scoring function Zi<i<„ dn {M, 
Si), where dniM, s,) is the hamming distance between 
motif M and sequence 5,. We take the motif with the 
lowest score and then predict transcription factor- 



binding sites based on it. Notice that we have only used 
one value for £ (namely, 21) because smaller values of i 
have been tested in [22]. 

We provide the detailed results in Table 4. In Table 4, 
the first column is the name of the dataset. The dataset 
is from mouse (resp. human) if the dataset's name starts 
with "mus" (resp. "hm"). The second column is the 
motif with the lowest score produced by algorithm 
PMS5. The third column is the verified transcription 
factor-binding sites that overlap with the predicted tran- 
scription factor-binding sites at least 60% of the motif 
length. We find that there are 10 out of 37 datasets in 
which the predicted transcription factor-binding sites 
are correct. In particular, one of the verified transcrip- 
tion factor-binding sites in dataset hm22r contains the 
predicted transcript factor-binding site. Therefore, we 
conclude that the results in Table 4 once again reaffirm 
the accuracy of the PMS model. In practice one could 
use PMSPrune (for values of £ up to 19) and PMS5 (for 
values of £ larger than 19) together to identify motifs. In 
this case the sensitivity will be better than using 
PMSPrune alone (or any of the algorithms reported in 
[24,25]). 

3.3 Performance of Extended PIV1S5 on the q-PMS 
challenging instances 

In this section, we show the performance of Extended 
PMS5 on the ^-PMS challenging instances as described 
in Section 2.4. The experiment setting is the same as 
that in Section 3.1. Any dataset will consist of 20 



Table 4 PMS5 on real datasets: predicting transcript factor-binding sites 


Dataset 


Best motif found by PIV1S5 


Matched binding sites at: 


musOSr 


AGAGGAAAAAAAAAAGGAGAG 


seq 1 : GGAAAAACAAAGGTAATG 


mus07r 


GCTGCCCACCCTCTGCAACCC 


seq 4: CCCAACACCTGCTGCCTGAGCC 


musl 1 r 


AGGGCGGGGGGGGGAGCGGGG 


seq 2: GCCGGCGGGGTGGGGCTGAG 
seq 3: GGGGGGGGGGGGGGGGC 
seq 4: GTGGGGGCGGGGCOT 
seq 9: GAACAGGAAGTGAGGCGG 


hm03r 


AAAAGAAAAAAAAATAAACAA 


seq 1 : GGGGTGTTATrCAAGCAAAAAAAATAAATAAATACCTATGCAATAC 
seq 2: GGATGTTACACAAGCAAACAAAATAAATATCTGTGCAATAT 
seq 3: TGGGTGTTATATGAGCAAACAAAATAAATACCTGTGCAACAT 


hm08r 


CAGCGTGCAGTCCCCrrCATC 


seq 10:TATGGTCATGACGTCTGACAGAGC 


hm19r 


CCCCCTTCCACCACCCACAGA 


seq 2: CAaTTTAGCTCCTCCCCCCA 


hm20r 


CCTCCrrCCTCCCCCTCCCCC 


seq 10: TCCTCCCCACCTTCCCCACCCTCCCCACCCTCCCCATAAGCGCCCCTCCCG 
seq 11: GCAAACrCCGCCTCCCCCAA 
seq 14: GTCCCTCCTCCTCCCGCCC 


hm22r 


GGACACGGCAGAGCCTGGGGA 


seq 4: GAGGCAGACCACGTGAGAGCCTGGCCAGGCCTTCC 


hm24r 


CGCCTGCGCCCCGCCCCGCCC 


seq 2: CCCCGCCCCGCGCTCCCC 


hm26r 


CCCCCCGCCTCCCGCrCCCAG 


seq 3: CCCCGCCrCAGGCTCCCGGGG 

seq 7: CTCAGCCTGCCCCTCCCAGGGATTAAG 

seq 8: GCGCCGAGGCGTCCCCGAGGCGC 
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Table 5 Run time of Extended PMS5 on q-PMS 
challenging instances 



q-PMS challenging instance 


Run time 


(9,1) 


78s 


(11, 2) 


748 m 


(13, 3) 


1 843 m 


(15,4) 


1.39h 


(17, 5) 


15.93h 


(19,6) 



random strings each of length 600 (« = 20, m = 600). 
We choose the parameter q = 10, which requires motifs 
to appear in at least 50% of input sequences. Note that 
this choice of q corresponds to the worst case run time 
(from among all possible values of q). Table 5 shows the 
run time of Extended PMS5 on the q-PMS challenging 
instances. Extended PMS5 can solve q-PMS challenging 
instances (17, 5) in 15.9 hours and it fails to solve q- 
PMS challenging instances (19, 6). 

4 Conclusions 

In this paper we have presented an efficient exact algo- 
rithm for the (£, li) -motif search problem. This algo- 
rithm is more efficient than any known exact 
algorithm for PMS. In particular, using this algorithm 
we can solve the challenging instances (21, 8) and (23, 
9). No prior exact algorithms could solve these 
instances. Therefore, we hope that PMS5 will help 
biologists discover longer motifs in future. Our algo- 
rithm is based on some novel ideas that will be of 
independent interest to solve PMS and other variations 
of the motif search problem. One of the basic ideas we 
employ is that of computing the common li-neighbor- 
hood of three £-mers. This is done using an integer 
programming formulation. An open problem will be to 
exploit this idea to further improve the performance of 
our algorithm. One possible direction is to use a basic 
step where the li-neighborhood of k £-mers is com- 
puted (for some relevant value of k). We have 
extended our algorithm to solve the q-PMS problem as 
well. Challenging instances for the ^-PMS problem 
have been defined and computed. Our extended algo- 
rithm can solve the following i^-PMS challenging 
instances: (9,1), (11, 2), (13, 3), (15, 4), and (17, 5). In 
comparison, the exact algorithms MITRA, RISOTTO, 
and Voting also can only solve challenging instances 
up to d = 5 (but for the version where the motifs 
occur in all the input strings). 
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